SUMMER OLYMPICS ANALYSIS FROM 1896 TO 2016¶
The "Summer Olympics Analysis from 1896 to 2016" involves a comprehensive examination of the data related to the Summer Olympic Games over this period. This analysis can cover various aspects, including participation, performance, trends, and insights into the evolution of the games. Below are some key areas of analysis:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
About Data¶
df=pd.read_csv('athlete_events.csv') # Data set from kaggle
region_df=pd.read_csv('noc_regions.csv')
df.tail()
| ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 271111 | 135569 | Andrzej ya | M | 29.0 | 179.0 | 89.0 | Poland-1 | POL | 1976 Winter | 1976 | Winter | Innsbruck | Luge | Luge Mixed (Men)'s Doubles | NaN |
| 271112 | 135570 | Piotr ya | M | 27.0 | 176.0 | 59.0 | Poland | POL | 2014 Winter | 2014 | Winter | Sochi | Ski Jumping | Ski Jumping Men's Large Hill, Individual | NaN |
| 271113 | 135570 | Piotr ya | M | 27.0 | 176.0 | 59.0 | Poland | POL | 2014 Winter | 2014 | Winter | Sochi | Ski Jumping | Ski Jumping Men's Large Hill, Team | NaN |
| 271114 | 135571 | Tomasz Ireneusz ya | M | 30.0 | 185.0 | 96.0 | Poland | POL | 1998 Winter | 1998 | Winter | Nagano | Bobsleigh | Bobsleigh Men's Four | NaN |
| 271115 | 135571 | Tomasz Ireneusz ya | M | 34.0 | 185.0 | 96.0 | Poland | POL | 2002 Winter | 2002 | Winter | Salt Lake City | Bobsleigh | Bobsleigh Men's Four | NaN |
df.shape
(271116, 15)
df = df[df['Season'] == 'Summer']
df.shape
(222552, 15)
region_df.tail()
| NOC | region | notes | |
|---|---|---|---|
| 225 | YEM | Yemen | NaN |
| 226 | YMD | Yemen | South Yemen |
| 227 | YUG | Serbia | Yugoslavia |
| 228 | ZAM | Zambia | NaN |
| 229 | ZIM | Zimbabwe | NaN |
region_df.shape
(230, 3)
df = df.merge(region_df,on='NOC',how='left')
df.columns
Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
dtype='object')
Columns Deitals¶
- ID
Description: This is a unique identifier for each athlete in the dataset. It helps in tracking individual records and ensuring data accuracy. Use: Can be used to differentiate between athletes with the same name or other similar attributes.
- Name
Description: The full name of the athlete. Use: Useful for identifying specific athletes and could be used in analyses focused on the performance of individual athletes or for visualization purposes.
- Sex
Description: Gender of the athlete (Male/Female). Use: Crucial for gender-based analysis, such as studying gender participation trends or comparing performances between male and female athletes.
- Age
Description: Age of the athlete during the Olympic event. Use: Useful for analyzing the age distribution of athletes, identifying trends in the ages of medal winners, or studying the impact of age on performance.
- Height
Description: Height of the athlete in centimeters. Use: Can be used to analyze the relationship between height and performance in different sports or events.
- Weight
Description: Weight of the athlete in kilograms. Use: Similar to height, this can be used to study the correlation between an athlete's weight and their performance, especially in sports where physical attributes are critical.
- Team
Description: The team or country the athlete represented. Use: Essential for country-based analysis, such as comparing the performance of different countries or studying the geographic distribution of medals.
- NOC
Description: National Olympic Committee code, a three-letter abbreviation representing the athlete's country. Use: Used to identify the country in a standardized format, helpful in aggregating and comparing data across countries.
- Games
Description: The specific Olympic Games in which the athlete participated (e.g., "2000 Summer"). Use: Useful for filtering data based on specific Olympic Games, analyzing trends over time, or comparing performances across different Olympics.
- Year
Description: The year in which the Olympic Games took place. Use: Allows for time-series analysis, such as studying trends across different years or the evolution of specific sports.
- Season
Description: Indicates whether the event is part of the Summer or Winter Olympics. Since your project is about the Summer Olympics, this will likely always be "Summer." Use: While this may not vary in your dataset, it's useful for filtering and confirming that all data pertains to the correct season.
- City
Description: The city where the Olympics were held. Use: Useful for location-based analysis, such as comparing the performance of athletes in different host cities or studying the impact of hosting on local athletes.
- Sport
Description: The broader category of sport in which the athlete competed (e.g., "Athletics"). Use: Allows for analysis at the sport level, such as comparing participation rates, medal counts, or trends within specific sports.
- Event
Description: The specific event within the sport (e.g., "100m sprint"). Use: Useful for detailed analysis within sports, such as comparing performances across different events or identifying events with the highest competition.
- Medal
Description: Indicates the type of medal won by the athlete (Gold, Silver, Bronze, or None). Use: Central to medal analysis, including identifying top-performing countries, athletes, or sports, and studying trends in medal distribution.
- Region
Description: The broader region associated with the athlete's country (e.g., "Europe"). Use: Enables regional analysis, such as comparing performance across different regions or studying the impact of regional factors on athletic success.
- Notes
Description: Any additional information or remarks related to the athlete or event. Use: This column may contain miscellaneous data that could be useful for specific cases or deeper insights.
Column wise Analysis¶
1. ID¶
df['ID'].value_counts().sum()
222552
The 'ID' column is a unique identifier for each athlete, but it doesn't provide any meaningful information for analysis. It doesn't contribute to understanding trends, patterns, or relationships within the data. So lets drop it
df.drop(['ID'],axis=1,inplace=True)
df.columns
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
dtype='object')
2. Name¶
df['Name'].value_counts().sum()
222552
df['Name'].isnull().sum()
0
3. Sex¶
df['Sex'].value_counts()
Sex M 163109 F 59443 Name: count, dtype: int64
df['Sex'].isnull().sum()
0
4. Age¶
df['Age'].value_counts()
Age
23.0 17521
24.0 17200
22.0 16852
25.0 15733
21.0 15519
...
77.0 2
84.0 1
96.0 1
10.0 1
97.0 1
Name: count, Length: 74, dtype: int64
df['Age'].unique()
array([24., 23., 34., 18., 26., 31., 30., 22., 28., 32., 54., 27., 43.,
47., 21., 25., 20., 33., 41., 45., 49., 53., 57., nan, 19., 29.,
38., 35., 16., 37., 42., 46., 40., 15., 36., 17., 14., 39., 48.,
52., 55., 50., 44., 71., 63., 51., 58., 60., 75., 65., 13., 56.,
64., 68., 84., 12., 72., 59., 61., 70., 74., 62., 67., 69., 73.,
66., 76., 88., 11., 96., 80., 10., 81., 77., 97.])
df['Age'].isnull().sum()
9189
5. Height¶
df['Height'].value_counts()
Height
170.0 9696
180.0 9589
175.0 8192
178.0 8162
168.0 6588
...
226.0 3
131.0 2
130.0 2
219.0 2
128.0 1
Name: count, Length: 95, dtype: int64
df['Height'].unique()
array([180., 170., nan, 168., 172., 159., 175., 189., 163., 187., 174.,
194., 182., 185., 177., 167., 176., 164., 198., 165., 156., 179.,
178., 166., 186., 190., 183., 181., 188., 155., 169., 160., 173.,
171., 192., 195., 184., 191., 203., 158., 154., 153., 152., 193.,
162., 197., 157., 207., 147., 150., 202., 161., 210., 205., 200.,
148., 208., 140., 213., 204., 151., 196., 146., 201., 132., 149.,
199., 211., 215., 139., 206., 135., 142., 217., 209., 145., 221.,
144., 214., 143., 218., 127., 223., 216., 141., 212., 131., 136.,
137., 138., 220., 130., 219., 133., 128., 226.])
df['Height'].isnull().sum()
51857
6.Weight¶
df['Weight'].value_counts()
Weight
70.0 7659
60.0 6389
75.0 6298
68.0 5794
65.0 5735
...
149.0 1
180.0 1
48.5 1
190.0 1
151.0 1
Name: count, Length: 219, dtype: int64
df['Weight'].unique().shape
(220,)
df['Weight'].isnull().sum()
53854
7.Team¶
df['Team'].value_counts()
Team
United States 14526
Great Britain 10309
France 10021
Italy 8051
Germany 7378
...
Souvenance 1
Crabe I-11 1
Crabe I-3 1
Crabe I-2 1
Digby 1
Name: count, Length: 1157, dtype: int64
df['Team'].unique().shape
(1157,)
df['Team'].isnull().sum()
0
8.NOC¶
df['NOC'].value_counts()
NOC
USA 15064
GBR 10917
FRA 10633
ITA 8217
GER 7622
...
YMD 5
SSD 3
NBO 2
UNK 2
NFL 1
Name: count, Length: 230, dtype: int64
df['NOC'].unique().shape
(230,)
df['NOC'].isnull().sum()
0
9.Games¶
df['Games'].value_counts()
Games 2000 Summer 13821 1996 Summer 13780 2016 Summer 13688 2008 Summer 13602 2004 Summer 13443 1992 Summer 12977 2012 Summer 12920 1988 Summer 12037 1972 Summer 10304 1984 Summer 9454 1976 Summer 8641 1968 Summer 8588 1952 Summer 8270 1960 Summer 8119 1964 Summer 7702 1980 Summer 7191 1936 Summer 6506 1948 Summer 6405 1924 Summer 5233 1956 Summer 5127 1928 Summer 4992 1920 Summer 4292 1912 Summer 4040 1908 Summer 3101 1932 Summer 2969 1900 Summer 1936 1906 Summer 1733 1904 Summer 1301 1896 Summer 380 Name: count, dtype: int64
df['Games'].unique().shape
(29,)
df['Games'].isnull().sum()
0
10.Year¶
df['Year'].value_counts()
Year 2000 13821 1996 13780 2016 13688 2008 13602 2004 13443 1992 12977 2012 12920 1988 12037 1972 10304 1984 9454 1976 8641 1968 8588 1952 8270 1960 8119 1964 7702 1980 7191 1936 6506 1948 6405 1924 5233 1956 5127 1928 4992 1920 4292 1912 4040 1908 3101 1932 2969 1900 1936 1906 1733 1904 1301 1896 380 Name: count, dtype: int64
df['Year'].unique().shape
(29,)
df['Year'].isnull().sum()
0
11.Seasion¶
df['Season'].value_counts()
Season Summer 222552 Name: count, dtype: int64
#As we know the that we are working in only Summer Olympics data, we can drop this column
df.drop(columns=['Season'],axis=1,inplace=True)
df.columns
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
'Year', 'City', 'Sport', 'Event', 'Medal', 'region', 'notes'],
dtype='object')
12. City¶
df['City'].value_counts()
City London 22426 Athina 15556 Sydney 13821 Atlanta 13780 Rio de Janeiro 13688 Beijing 13602 Barcelona 12977 Los Angeles 12423 Seoul 12037 Munich 10304 Montreal 8641 Mexico City 8588 Helsinki 8270 Roma 8119 Tokyo 7702 Moskva 7191 Paris 7169 Berlin 6506 Amsterdam 4992 Melbourne 4829 Stockholm 4338 Antwerpen 4292 St. Louis 1301 Name: count, dtype: int64
df['City'].unique().shape
(23,)
df['City'].isnull().sum()
0
13. Sport¶
df['Sport'].value_counts()
Sport Athletics 38624 Gymnastics 26707 Swimming 23195 Shooting 11448 Cycling 10859 Fencing 10735 Rowing 10595 Wrestling 7154 Football 6745 Sailing 6586 Equestrianism 6344 Canoeing 6171 Boxing 6047 Hockey 5417 Basketball 4536 Weightlifting 3937 Water Polo 3846 Judo 3801 Handball 3665 Art Competitions 3578 Volleyball 3404 Tennis 2862 Diving 2842 Archery 2334 Table Tennis 1955 Modern Pentathlon 1677 Badminton 1457 Synchronized Swimming 909 Baseball 894 Rhythmic Gymnastics 658 Taekwondo 606 Beach Volleyball 564 Triathlon 529 Softball 478 Rugby Sevens 299 Golf 247 Tug-Of-War 170 Rugby 162 Trampolining 152 Polo 95 Lacrosse 60 Ice Hockey 60 Figure Skating 54 Cricket 24 Croquet 19 Motorboating 17 Racquets 12 Jeu De Paume 11 Roque 4 Alpinism 4 Basque Pelota 2 Aeronautics 1 Name: count, dtype: int64
df['Sport'].unique().shape
(52,)
df['Sport'].isnull().sum()
0
14. Event¶
df['Event'].value_counts()
Event
Football Men's Football 5733
Hockey Men's Hockey 3958
Water Polo Men's Water Polo 3358
Basketball Men's Basketball 3280
Cycling Men's Road Race, Individual 2947
...
Archery Men's Target Archery, 50 metres, Individual 2
Sailing Mixed 18 foot 2
Croquet Mixed Doubles 2
Basque Pelota Men's Two-Man Teams With Cesta 2
Aeronautics Mixed Aeronautics 1
Name: count, Length: 651, dtype: int64
df['Event'].unique().shape
(651,)
df['Event'].isnull().sum()
0
15.Medal¶
df['Medal'].value_counts()
Medal Gold 11459 Bronze 11409 Silver 11220 Name: count, dtype: int64
df['Medal'].unique().shape
(4,)
df['Medal'].isnull().sum()
188464
16.region¶
df['region'].value_counts()
region
USA 15064
Germany 12377
UK 10917
France 10633
Russia 8855
...
Kiribati 11
Brunei 10
Timor-Leste 8
Kosovo 8
South Sudan 3
Name: count, Length: 205, dtype: int64
df['region'].unique().shape
(206,)
df['region'].isnull().sum()
370
17. notes¶
df['notes'].value_counts()
notes Yugoslavia 2046 Hong Kong 673 Trinidad and Tobago 368 Serbia and Montenegro 300 Virgin Islands 239 Bohemia 153 Antigua and Barbuda 133 United Arab Republic 123 Individual Olympic Athletes 93 Australasia 86 Netherlands Antilles 74 Turks and Caicos Islands 42 West Indies Federation 20 Refugee Olympic Team 12 North Yemen 11 Crete 11 Tuvalu 7 South Yemen 5 North Borneo 2 Unknown 2 Newfoundland 1 Name: count, dtype: int64
df['notes'].unique().shape
(22,)
df['notes'].isnull().sum()
218151
Medal Tally¶
df['region'].unique().shape
(206,)
df.isnull().sum()
Name 0 Sex 0 Age 9189 Height 51857 Weight 53854 Team 0 NOC 0 Games 0 Year 0 City 0 Sport 0 Event 0 Medal 188464 region 370 notes 218151 dtype: int64
- As hear we see their are no missing values in NOC columns and thair are 370 missing values in region columns,It is due thair are some columns where NOC is not converted to region in the dataset
df.duplicated().sum()
1385
df.drop_duplicates(inplace=True)
df.duplicated().sum()
0
df['Medal'].value_counts()
Medal Gold 11456 Bronze 11409 Silver 11212 Name: count, dtype: int64
df['Medal'].unique()
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)
pd.get_dummies(df['Medal'])
| Bronze | Gold | Silver | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | True | False |
| 4 | False | False | False |
| ... | ... | ... | ... |
| 222547 | False | False | False |
| 222548 | False | False | False |
| 222549 | False | False | False |
| 222550 | False | False | False |
| 222551 | False | False | False |
221167 rows × 3 columns
# concatinate with original dataframe
df=pd.concat([df,pd.get_dummies(df['Medal'])],axis=1) # One hot incoding
df.dtypes
Name object Sex object Age float64 Height float64 Weight float64 Team object NOC object Games object Year int64 City object Sport object Event object Medal object region object notes object Bronze bool Gold bool Silver bool dtype: object
df[['Bronze', 'Gold', 'Silver']] = df[['Bronze', 'Gold', 'Silver']].astype(int)
df.head()
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN | 0 | 0 | 0 |
| 1 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN | 0 | 0 | 0 |
| 2 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN | 0 | 0 | 0 |
| 3 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN | 0 | 1 | 0 |
| 4 | Cornelia "Cor" Aalten (-Strannood) | F | 18.0 | 168.0 | NaN | Netherlands | NED | 1932 Summer | 1932 | Los Angeles | Athletics | Athletics Women's 100 metres | NaN | Netherlands | NaN | 0 | 0 | 0 |
df.shape
(221167, 18)
df.shape
(221167, 18)
# groupby with NOC
df.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
| NOC | Gold | Silver | Bronze | |
|---|---|---|---|---|
| 0 | USA | 2472 | 1333 | 1197 |
| 1 | URS | 832 | 635 | 596 |
| 2 | GBR | 635 | 729 | 620 |
| 3 | GER | 592 | 538 | 649 |
| 4 | ITA | 518 | 474 | 454 |
| ... | ... | ... | ... | ... |
| 225 | AHO | 0 | 1 | 0 |
| 226 | LBR | 0 | 0 | 0 |
| 227 | LCA | 0 | 0 | 0 |
| 228 | LES | 0 | 0 | 0 |
| 229 | LBA | 0 | 0 | 0 |
230 rows × 4 columns
- It is a wrong output because if a country win a gold medal in a perticular game that means its gold medal count is one,but if the perticular event has number of players then each players win gold medal each thats why number of gold medal is showing much more and its same for all medal.
- Below one example for india
df[(df['NOC']=='IND') & (df['Medal'] == 'Gold')]
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4186 | Shaukat Ali | M | 30.0 | NaN | NaN | India | IND | 1928 Summer | 1928 | Amsterdam | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 4190 | Syed Mushtaq Ali | M | 22.0 | 165.0 | 61.0 | India | IND | 1964 Summer | 1964 | Tokyo | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 4460 | Richard James Allen | M | 25.0 | 172.0 | NaN | India | IND | 1928 Summer | 1928 | Amsterdam | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 4461 | Richard James Allen | M | 30.0 | 172.0 | NaN | India | IND | 1932 Summer | 1932 | Los Angeles | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 4462 | Richard James Allen | M | 34.0 | 172.0 | NaN | India | IND | 1936 Summer | 1936 | Berlin | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 191759 | Dung Dung Sylvanus | M | 31.0 | 160.0 | 62.0 | India | IND | 1980 Summer | 1980 | Moskva | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 193964 | Carlyle Carrol Tapsell | M | 23.0 | 182.0 | NaN | India | IND | 1932 Summer | 1932 | Los Angeles | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 193965 | Carlyle Carrol Tapsell | M | 27.0 | 182.0 | NaN | India | IND | 1936 Summer | 1936 | Berlin | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 205949 | Max "Maxie" Vaz | M | NaN | NaN | NaN | India | IND | 1948 Summer | 1948 | London | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
| 219336 | Sayed Muhammad Yusuf | M | NaN | NaN | NaN | India | IND | 1928 Summer | 1928 | Amsterdam | Hockey | Hockey Men's Hockey | Gold | India | NaN | 0 | 1 | 0 |
131 rows × 18 columns
- For Correct this we drop duplicates from following columns ['Team','NOC','Games','Year','City','Sport','Event','Medal']
medal_tally=df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'])
medal_tally.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
| NOC | Gold | Silver | Bronze | |
|---|---|---|---|---|
| 0 | USA | 1035 | 802 | 708 |
| 1 | URS | 394 | 317 | 294 |
| 2 | GBR | 278 | 317 | 300 |
| 3 | GER | 235 | 261 | 283 |
| 4 | FRA | 234 | 256 | 287 |
| ... | ... | ... | ... | ... |
| 225 | AHO | 0 | 1 | 0 |
| 226 | LBR | 0 | 0 | 0 |
| 227 | LCA | 0 | 0 | 0 |
| 228 | LES | 0 | 0 | 0 |
| 229 | LBA | 0 | 0 | 0 |
230 rows × 4 columns
medal_tally[medal_tally['NOC'] == 'IND']
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 436 | S. Abdul Hamid | M | NaN | NaN | NaN | India | IND | 1928 Summer | 1928 | Amsterdam | Athletics | Athletics Men's 110 metres Hurdles | NaN | India | NaN | 0 | 0 | 0 |
| 437 | S. Abdul Hamid | M | NaN | NaN | NaN | India | IND | 1928 Summer | 1928 | Amsterdam | Athletics | Athletics Men's 400 metres Hurdles | NaN | India | NaN | 0 | 0 | 0 |
| 790 | Shiny Kurisingal Abraham-Wilson | F | 19.0 | 167.0 | 53.0 | India | IND | 1984 Summer | 1984 | Los Angeles | Athletics | Athletics Women's 800 metres | NaN | India | NaN | 0 | 0 | 0 |
| 791 | Shiny Kurisingal Abraham-Wilson | F | 19.0 | 167.0 | 53.0 | India | IND | 1984 Summer | 1984 | Los Angeles | Athletics | Athletics Women's 4 x 400 metres Relay | NaN | India | NaN | 0 | 0 | 0 |
| 792 | Shiny Kurisingal Abraham-Wilson | F | 23.0 | 167.0 | 53.0 | India | IND | 1988 Summer | 1988 | Seoul | Athletics | Athletics Women's 800 metres | NaN | India | NaN | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 216942 | Vikas Krishan Yadav | M | 24.0 | 177.0 | 69.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Boxing | Boxing Men's Middleweight | NaN | India | NaN | 0 | 0 | 0 |
| 216985 | Mohammad Anas Yahiya | M | 21.0 | 177.0 | 69.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Athletics | Athletics Men's 400 metres | NaN | India | NaN | 0 | 0 | 0 |
| 218411 | Thyadathuvilla Chandrapillai "T. C." Yohannan | M | 29.0 | 174.0 | 62.0 | India | IND | 1976 Summer | 1976 | Montreal | Athletics | Athletics Men's Long Jump | NaN | India | NaN | 0 | 0 | 0 |
| 222416 | Geeta Zutshi | F | 23.0 | 167.0 | 51.0 | India | IND | 1980 Summer | 1980 | Moskva | Athletics | Athletics Women's 800 metres | NaN | India | NaN | 0 | 0 | 0 |
| 222417 | Geeta Zutshi | F | 27.0 | 167.0 | 51.0 | India | IND | 1984 Summer | 1984 | Los Angeles | Athletics | Athletics Women's 3,000 metres | NaN | India | NaN | 0 | 0 | 0 |
687 rows × 18 columns
medal_tally = medal_tally.groupby('region').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
medal_tally['total'] = medal_tally['Gold'] + medal_tally['Silver'] + medal_tally['Bronze']
medal_tally
| region | Gold | Silver | Bronze | total | |
|---|---|---|---|---|---|
| 0 | USA | 1035 | 802 | 708 | 2545 |
| 1 | Russia | 592 | 498 | 487 | 1577 |
| 2 | Germany | 444 | 457 | 491 | 1392 |
| 3 | UK | 278 | 317 | 300 | 895 |
| 4 | France | 234 | 256 | 287 | 777 |
| ... | ... | ... | ... | ... | ... |
| 200 | Lesotho | 0 | 0 | 0 | 0 |
| 201 | Albania | 0 | 0 | 0 | 0 |
| 202 | Libya | 0 | 0 | 0 | 0 |
| 203 | Liechtenstein | 0 | 0 | 0 | 0 |
| 204 | Liberia | 0 | 0 | 0 | 0 |
205 rows × 5 columns
In Which Year Olympics was Played¶
years = df['Year'].unique().tolist()
years.sort()
years.insert(0,'Overall')
years
['Overall', 1896, 1900, 1904, 1906, 1908, 1912, 1920, 1924, 1928, 1932, 1936, 1948, 1952, 1956, 1960, 1964, 1968, 1972, 1976, 1980, 1984, 1988, 1992, 1996, 2000, 2004, 2008, 2012, 2016]
- Till 2016 , Summer Olympics was played 29 times staring from 1896 to 2016
Which Country Participate in Summer Olympics¶
country = np.unique(df['region'].dropna().values).tolist()
country.sort()
country
['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Antigua', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Boliva', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guam', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Individual Olympic Athletes', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Korea', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Congo', 'Romania', 'Russia', 'Rwanda', 'Saint Kitts', 'Saint Lucia', 'Saint Vincent', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Trinidad', 'Tunisia', 'Turkey', 'Turkmenistan', 'UK', 'USA', 'Uganda', 'Ukraine', 'United Arab Emirates', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands, British', 'Virgin Islands, US', 'Yemen', 'Zambia', 'Zimbabwe']
country.insert(0,'Overall')
country
['Overall', 'Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Antigua', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Boliva', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic', 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guam', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Honduras', 'Hungary', 'Iceland', 'India', 'Individual Olympic Athletes', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'North Korea', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Congo', 'Romania', 'Russia', 'Rwanda', 'Saint Kitts', 'Saint Lucia', 'Saint Vincent', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga', 'Trinidad', 'Tunisia', 'Turkey', 'Turkmenistan', 'UK', 'USA', 'Uganda', 'Ukraine', 'United Arab Emirates', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands, British', 'Virgin Islands, US', 'Yemen', 'Zambia', 'Zimbabwe']
medal_df = df.drop_duplicates(subset=['Team', 'NOC', 'Games', 'Year', 'City', 'Sport', 'Event', 'Medal'])
medal_df.head()
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN | 0 | 0 | 0 |
| 1 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN | 0 | 0 | 0 |
| 2 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN | 0 | 0 | 0 |
| 3 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN | 0 | 1 | 0 |
| 4 | Cornelia "Cor" Aalten (-Strannood) | F | 18.0 | 168.0 | NaN | Netherlands | NED | 1932 Summer | 1932 | Los Angeles | Athletics | Athletics Women's 100 metres | NaN | Netherlands | NaN | 0 | 0 | 0 |
def fetch_medal_tally(year, country):
# Drop duplicates based on the relevant columns
medal_df = df.drop_duplicates(subset=['Team', 'NOC', 'Games', 'Year', 'City', 'Sport', 'Event', 'Medal'])
flag = 0
# Filter the DataFrame based on the input year and country
if year == 'Overall' and country == 'Overall':
temp_df = medal_df
if year == 'Overall' and country != 'Overall':
flag = 1
temp_df = medal_df[medal_df['region'] == country]
if year != 'Overall' and country == 'Overall':
temp_df = medal_df[medal_df['Year'] == int(year)]
if year != 'Overall' and country != 'Overall':
temp_df = medal_df[(medal_df['Year'] == 2016) & (medal_df['region'] == country)]
# Group and summarize the data
if flag == 1:
x = temp_df.groupby('Year').sum()[['Gold', 'Silver', 'Bronze']].sort_values('Year').reset_index()
else:
x = temp_df.groupby('region').sum()[['Gold', 'Silver', 'Bronze']].sort_values('Gold', ascending=False).reset_index()
# Calculate the total medals
x['Total'] = x['Gold'] + x['Silver'] + x['Bronze']
print(x)
fetch_medal_tally(year='Overall',country='USA')
Year Gold Silver Bronze Total 0 1896 11 6 2 19 1 1900 21 16 17 54 2 1904 76 79 76 231 3 1906 12 5 6 23 4 1908 22 12 12 46 5 1912 26 18 19 63 6 1920 41 27 27 95 7 1924 45 27 27 99 8 1928 22 18 16 56 9 1932 44 36 30 110 10 1936 24 21 12 57 11 1948 38 27 19 84 12 1952 40 19 17 76 13 1956 32 25 17 74 14 1960 34 21 16 71 15 1964 36 26 28 90 16 1968 45 28 34 107 17 1972 33 31 30 94 18 1976 34 35 25 94 19 1984 82 61 30 173 20 1988 36 31 27 94 21 1992 37 34 37 108 22 1996 44 32 25 101 23 2000 36 24 31 91 24 2004 36 39 26 101 25 2008 36 39 35 110 26 2012 46 28 29 103 27 2016 46 37 38 121
medal_df[(medal_df['Year'] == 2016) & (medal_df['region'] == 'India')]
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1015 | Sharath Kamal Achanta | M | 34.0 | 186.0 | 85.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Table Tennis | Table Tennis Men's Singles | NaN | India | NaN | 0 | 0 | 0 |
| 7065 | Seema Antil | F | 33.0 | 182.0 | 92.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Athletics | Athletics Women's Discus Throw | NaN | India | NaN | 0 | 0 | 0 |
| 8713 | Aditi Ashok | F | 18.0 | 173.0 | 57.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Golf | Golf Women's Individual | NaN | India | NaN | 0 | 0 | 0 |
| 9202 | Manu Attri | M | 23.0 | 172.0 | 73.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Badminton | Badminton Men's Doubles | NaN | India | NaN | 0 | 0 | 0 |
| 10070 | Lalita Shivaji Babar | F | 27.0 | 166.0 | 50.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Athletics | Athletics Women's 3,000 metres Steeplechase | NaN | India | NaN | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 182260 | Sathish Kumar Sivalingam | M | 24.0 | 175.0 | 77.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Weightlifting | Weightlifting Men's Middleweight | NaN | India | NaN | 0 | 0 | 0 |
| 195568 | Shiva Thapa | M | 22.0 | 169.0 | 56.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Boxing | Boxing Men's Bantamweight | NaN | India | NaN | 0 | 0 | 0 |
| 198042 | Sandeep Tomar | M | 25.0 | 168.0 | 61.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Wrestling | Wrestling Men's Featherweight, Freestyle | NaN | India | NaN | 0 | 0 | 0 |
| 216942 | Vikas Krishan Yadav | M | 24.0 | 177.0 | 69.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Boxing | Boxing Men's Middleweight | NaN | India | NaN | 0 | 0 | 0 |
| 216985 | Mohammad Anas Yahiya | M | 21.0 | 177.0 | 69.0 | India | IND | 2016 Summer | 2016 | Rio de Janeiro | Athletics | Athletics Men's 400 metres | NaN | India | NaN | 0 | 0 | 0 |
68 rows × 18 columns
OVERALL ANALYSIS¶
1. N0.OF EDITIONS¶
df.head()
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN | 0 | 0 | 0 |
| 1 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN | 0 | 0 | 0 |
| 2 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN | 0 | 0 | 0 |
| 3 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN | 0 | 1 | 0 |
| 4 | Cornelia "Cor" Aalten (-Strannood) | F | 18.0 | 168.0 | NaN | Netherlands | NED | 1932 Summer | 1932 | Los Angeles | Athletics | Athletics Women's 100 metres | NaN | Netherlands | NaN | 0 | 0 | 0 |
df['Year'].unique().shape
(29,)
- Till 2016 There 28 time olympics was conducted, But as per our data its showing 29.Because we all know that olympics conducted every 4 years of interval but after 1904 its next sceducle year is 1906.But IOC not recognised 1906 event as a olympics game.
2.NO OF CITIES¶
df['City'].unique().shape
(23,)
- Till 2016, 23 cities host summer olympics
3.No of events/sports¶
df['Sport'].unique().shape
(52,)
df['Event'].unique().shape
(651,)
- Till 2016, 651 Event conducted in 52 Sports
4.NO OF ATHELETES¶
df['Name'].unique().shape
(116122,)
- Total 116122 Atheletes Participate
5.No OF PARTICIPATING NATIONS¶
df['region'].unique().shape
(206,)
- Total 206 Nation Participate in summer olympics till 2016
Nation over Time¶
df.head()
| Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | City | Sport | Event | Medal | region | notes | Bronze | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Barcelona | Basketball | Basketball Men's Basketball | NaN | China | NaN | 0 | 0 | 0 |
| 1 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | London | Judo | Judo Men's Extra-Lightweight | NaN | China | NaN | 0 | 0 | 0 |
| 2 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Antwerpen | Football | Football Men's Football | NaN | Denmark | NaN | 0 | 0 | 0 |
| 3 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold | Denmark | NaN | 0 | 1 | 0 |
| 4 | Cornelia "Cor" Aalten (-Strannood) | F | 18.0 | 168.0 | NaN | Netherlands | NED | 1932 Summer | 1932 | Los Angeles | Athletics | Athletics Women's 100 metres | NaN | Netherlands | NaN | 0 | 0 | 0 |
nations_over_time=df.drop_duplicates(['Year','region'])['Year'].value_counts().reset_index().sort_values('Year')
nations_over_time
| Year | count | |
|---|---|---|
| 28 | 1896 | 12 |
| 22 | 1900 | 31 |
| 27 | 1904 | 14 |
| 26 | 1906 | 20 |
| 25 | 1908 | 22 |
| 24 | 1912 | 29 |
| 23 | 1920 | 29 |
| 21 | 1924 | 45 |
| 20 | 1928 | 46 |
| 19 | 1932 | 47 |
| 18 | 1936 | 49 |
| 17 | 1948 | 59 |
| 16 | 1952 | 67 |
| 15 | 1956 | 71 |
| 13 | 1960 | 83 |
| 11 | 1964 | 93 |
| 10 | 1968 | 111 |
| 9 | 1972 | 120 |
| 12 | 1976 | 91 |
| 14 | 1980 | 80 |
| 8 | 1984 | 139 |
| 7 | 1988 | 156 |
| 6 | 1992 | 168 |
| 5 | 1996 | 196 |
| 4 | 2000 | 199 |
| 3 | 2004 | 200 |
| 2 | 2008 | 202 |
| 1 | 2012 | 203 |
| 0 | 2016 | 204 |
nations_over_time.rename(columns={'Year':'Edition','count':'No of Countries'},inplace=True)
import plotly.express as px
fig = px.line(nations_over_time,x='Edition',y='No of Countries')
fig.show()
- In 1896 its only 12 nation participate but in 2016 it jump upto 204
No.of Events over time (Every Sport)¶
import seaborn as sns
import matplotlib.pyplot as plt
x = df.drop_duplicates(['Year','Sport','Event'])
plt.figure(figsize=(20,20))
sns.heatmap(x.pivot_table(index='Sport',columns='Year',values='Event',aggfunc='count').fillna(0).astype('int'),annot=True)
<Axes: xlabel='Year', ylabel='Sport'>
- In Athletics, the number of Events are increasing from 12 in 1896 to 47 in 2016
- Baseball was played 5 times in Summer Olympics history from 1992 to 2008 with 1 event each time
- Boxing is continuously increasing with 13 events in 2016
- Cricket was played only one time in the year 1900
- Cycling has 18 events each in the last 5 Olympics
- Except in 1896 and 1932 football was played in every Olympics with 2 events in the last 6 Olympics
- Golf played 3 times i.e 1900,1904 and 2016 with 2 events in each Olympics
- Gymnastics played in every Olympics, in the last 15 Olympics gymnastics was played in 14 different category
- Except for 1st Olympics Rowing was played in every Olympics
- Swimming played in every Olympics with 34 categories in 2016
- Volleyball was added in 1964 with 2 events and continuing the same
- Weightlifting and Wrestling two famous sports in Olympics history with 15 and 18 event in 2016 respectively
Most successful Athletes¶
def most_successful(df,sport):
temp_df = df.dropna(subset=['Medal'])
if sport != 'Overall':
temp_df = temp_df[temp_df]['Sport'] == sport
return temp_df['Name'].value_counts().reset_index()
most_successful(df,'Overall')
| Name | count | |
|---|---|---|
| 0 | Michael Fred Phelps, II | 28 |
| 1 | Larysa Semenivna Latynina (Diriy-) | 18 |
| 2 | Nikolay Yefimovich Andrianov | 15 |
| 3 | Borys Anfiyanovych Shakhlin | 13 |
| 4 | Takashi Ono | 13 |
| ... | ... | ... |
| 24540 | Joel Isasi Gonzlez | 1 |
| 24541 | Florena Andreea Isrescu | 1 |
| 24542 | Ryoji Isaoka | 1 |
| 24543 | Preben Isaksson | 1 |
| 24544 | rpd Lengyel | 1 |
24545 rows × 2 columns
def most_successful(df, sport):
# Drop rows where 'Medal' is NaN
temp_df = df.dropna(subset=['Medal'])
# Filter by sport if it's not 'Overall'
if sport != 'Overall':
temp_df = temp_df[temp_df['Sport'] == sport]
# Get the top 15 names by medal count
top_names = temp_df['Name'].value_counts().reset_index()
top_names.columns = ['Name', 'count']
# Merge with relevant columns from the original DataFrame
result = top_names.merge(df[['Name', 'Sport', 'region']], on='Name', how='left').drop_duplicates()
# Ensure to return relevant columns
return result[['Name', 'count', 'Sport', 'region']]
most_successful(df,'Overall').head(15)
| Name | count | Sport | region | |
|---|---|---|---|---|
| 0 | Michael Fred Phelps, II | 28 | Swimming | USA |
| 30 | Larysa Semenivna Latynina (Diriy-) | 18 | Gymnastics | Russia |
| 49 | Nikolay Yefimovich Andrianov | 15 | Gymnastics | Russia |
| 73 | Borys Anfiyanovych Shakhlin | 13 | Gymnastics | Russia |
| 97 | Takashi Ono | 13 | Gymnastics | Japan |
| 129 | Takashi Ono | 13 | Judo | Japan |
| 130 | Edoardo Mangiarotti | 13 | Fencing | Italy |
| 144 | Dara Grace Torres (-Hoffman, -Minas) | 12 | Swimming | USA |
| 157 | Aleksey Yuryevich Nemov | 12 | Gymnastics | Russia |
| 178 | Jennifer Elisabeth "Jenny" Thompson (-Cumpelik) | 12 | Swimming | USA |
| 195 | Birgit Fischer-Schmidt | 12 | Canoeing | Germany |
| 208 | Ryan Steven Lochte | 12 | Swimming | USA |
| 222 | Paavo Johannes Nurmi | 12 | Athletics | Finland |
| 234 | Sawao Kato | 12 | Gymnastics | Japan |
| 258 | Natalie Anne Coughlin (-Hall) | 12 | Swimming | USA |
- In the history of summer olympics Michael Fred Phelps from USA win 28 medal in Swimming.
Country wise¶
1. Countrywise medal tally per year(line plot)¶
temp_df=df.dropna(subset=['Medal'])
temp_df.drop_duplicates(subset=['Team','NOC','Games','Year','City','Sport','Event','Medal'],inplace=True)
new_df = temp_df[temp_df['region'] == 'India']
final_df=new_df.groupby('Year').count()['Medal'].reset_index()
fig = px.line(final_df,x='Year',y='Medal')
fig.show()
- India win 6 medal in the 2012 olympics
2.What country are good at heatmap¶
new_df = temp_df[temp_df['region'] == 'Russia']
plt.figure(figsize=(20,20))
sns.heatmap(new_df.pivot_table(index='Sport',columns='Year',values='Medal',aggfunc='count').fillna(0),annot=True)
<Axes: xlabel='Year', ylabel='Sport'>
- Russia won 41 medals in Athletics in the year 1980, which is the best achievement in a particular sport in a particular year in Summer Olympics history for Russia
- For Russia Athletics, Gymnastics and wrestling are the top three sports
- The USA won 65 medals in Athletics in the year 1904, which is the best achievement in a particular sport in a particular year in Summer Olympics history for the USA
- For the USA Atheletics and Swiming are the top two sports
- For India Hockey is the best sports
3. Most succesful Athletes(Top 10)¶
def most_successful(df, country):
# Drop rows where 'Medal' is NaN
temp_df = df.dropna(subset=['Medal'])
# Filter by sport if it's not 'Overall'
if country != 'Overall':
temp_df = temp_df[temp_df['region'] == country]
# Get the top 15 names by medal count
top_names = temp_df['Name'].value_counts().reset_index()
top_names.columns = ['Name', 'count']
# Merge with relevant columns from the original DataFrame
result = top_names.merge(df[['Name', 'Sport',]], on='Name', how='left').drop_duplicates()
# Ensure to return relevant columns
return result[['Name', 'count', 'Sport', ]]
most_successful(df,'China').head(10)
| Name | count | Sport | |
|---|---|---|---|
| 0 | Yang Wei | 7 | Badminton |
| 3 | Yang Wei | 7 | Gymnastics |
| 26 | Wu Minxia | 7 | Diving |
| 33 | Wang Yifu | 6 | Shooting |
| 43 | Li Ning | 6 | Gymnastics |
| 59 | Zou Kai | 6 | Gymnastics |
| 67 | Sun Yang | 6 | Swimming |
| 77 | Wang Hao | 6 | Table Tennis |
| 83 | Wang Hao | 6 | Athletics |
| 84 | Wang Hao | 6 | Diving |
- For India, Udham Singh Kular and Leslie Walter Claudius are the two best Athelet with 4 medals each in Hockey
- For Russia, Larysa Semenivna Latyninar is the best Athelet with 18 medals in Gymnastics
- For Italy, Edoardo Mangiarotti is the best player with 13 medals in Fencing
- For Japan, Takashi Ono is the best player with 13 medals in Gymnastics
- For Germany, Birgit Fischer-Schmidt is the best player with 12 medals in Canoeing
- For the UK, Henry Taylor with 8 medals in swimming
- For Finland, Paavo Johannes Nurmi with 12 medals in Athletics
- For China, Yang Wei won 7 medals each in Badminton and Gymnastics
Athlete Wise Analysis¶
1. Age vs Medal¶
import plotly.figure_factory as ff
athlete_df=df.drop_duplicates(subset=['Name','region'])
x1 = athlete_df['Age'].dropna()
x2 = athlete_df[athlete_df['Medal'] == 'Gold']['Age'].dropna()
x3 = athlete_df[athlete_df['Medal'] == 'Silver']['Age'].dropna()
x4 = athlete_df[athlete_df['Medal'] == 'Bronze']['Age'].dropna()
fig = ff.create_distplot([x1,x2,x3,x4],['Ovreall Age','Gold Medalist','Silver Medalist','Bronze Medalist'],show_hist=False,show_rug=False)
fig.show()
- When an Athelets age is 10 to 17 and the athlete is entering a particular Sport, Then his chance to win a Silver Medal is more than that of Gold Medal
- When an Athelet age is 21 to 25 Then his chance to win a Gold Medal is more than that of Silver Medal
- When an Athelet age is between 26 to 32 Then his chance to win a Silver Medal is more than that of Gold Medal
- When an Athelet age is 33 to 36 Then his chance to win a Gold Medal is more than that of Silver Medal
- When an Athletes age is 22-23 his probability to win a medal is maximum
2.Sport vs Age¶
famous_sports = ['Basketball', 'Judo', 'Football', 'Tug-Of-War', 'Athletics',
'Swimming', 'Badminton', 'Sailing', 'Gymnastics',
'Art Competitions', 'Handball', 'Weightlifting', 'Wrestling',
'Water Polo', 'Hockey', 'Rowing', 'Fencing',
'Shooting', 'Boxing', 'Taekwondo', 'Cycling', 'Diving', 'Canoeing',
'Tennis', 'Golf', 'Softball', 'Archery',
'Volleyball', 'Synchronized Swimming', 'Table Tennis', 'Baseball',
'Rhythmic Gymnastics', 'Rugby Sevens',
'Beach Volleyball', 'Triathlon', 'Rugby', 'Polo',
'Ice Hockey']
x = []
name = []
for sport in famous_sports:
temp_df = athlete_df[athlete_df['Sport'] == sport]
x.append(temp_df[temp_df['Medal']=='Gold']['Age'].dropna())
name.append(sport)
fig =ff.create_distplot(x,name,show_hist=False,show_rug=False)
fig.show()
- For Rhythmic Gymnastics, when an athlete is in age between 15-and 20 the chance to win a gold medal is maximum
- For Beach Volleyball, when an athlete is aged between 25 and 30 the chance to win a gold medal is maximum
- For an Art competition one athlete can play up to 59 years
- For Polo chance for wining a gold medal is 33 to 36
- For Cycling 20 - 23 is the best years
- One Athelitc can play Shooting up to 60 years
- One Athelitc can play Water Polo for up to 40 years and its peak year is 22-24
- As a comparison between Polo and Water polo, age is a vital factor for water polo
Height vs Weight¶
athlete_df['Medal'].fillna('No Medal',inplace=True)
plt.figure(figsize=(10,10))
temp_df = athlete_df[athlete_df['Sport'] == 'Weightlifting'] # Compare with a string, not a list
sns.scatterplot(x=temp_df['Weight'], y=temp_df['Height'], hue=temp_df['Medal'], style=temp_df['Sex'], s=100)
plt.show() # Optional: To display the plot
- In Wrestling Males is more than female
- In Weightlifting female is more than that of male
Men vs Women Participation¶
men = athlete_df [athlete_df ['Sex'] == 'M'].groupby('Year').count()['Name'].reset_index()
women = athlete_df[athlete_df['Sex'] == 'F'].groupby('Year').count()['Name'].reset_index()
final = men.merge(women,on='Year',how='left')
final.rename(columns={'Name_x':'Male','Name_y':'Female'},inplace = True)
final.fillna(0,inplace=True)
fig = px.line(final,x='Year',y=['Male','Female'])
fig.show()
- When Summer Olympics started, Female participation is 0, But as this event is going on female participation is continously incresing and first female paricipation is in 1900 and it is 23 number
conclusion¶
- Till 2016, the Summer Olympics was played 28 times starting from 1896 to 2016
- Till 2016, 23 cities hosted the Summer Olympics.
- Till 2016, 651 Event conducted in 52 Sports
- A total of 116122 Athletes Participate
- A total of 206 nations participated in the Summer Olympics till 2016
- In 1896 only 12 nations participated but in 2016 it jumped up to 204
- In Athletics, the number of Events are increasing from 12 in 1896 to 47 in 2016
- Baseball was played 5 times in Summer Olympics history from 1992 to 2008 with 1 event each time
- Boxing is continuously increasing with 13 events in 2016
- Cricket was played only one time in the year 1900
- Cycling has 18 events each in the last 5 Olympics
- Except in 1896 and 1932 football was played in every Olympics with 2 events in the last 6 Olympics
- Golf played 3 times i.e 1900,1904 and 2016 with 2 events in each Olympics
- Gymnastics is played in every Olympics, in the last 15 Olympics gymnastics was played in 14 different category
- Except for 1st Olympics Rowing was played in every Olympics
- Swimming played in every Olympics with 34 categories in 2016
- Volleyball was added in 1964 with 2 events and continued the same till 2016
- Weightlifting and Wrestling are two famous sports in Olympic history with 15 and 18 events in 2016 respectively
- In the history of the Summer Olympics Michael Fred Phelps from the USA won 28 medals in Swimming and is the most successful player.
- Russia won 41 medals in Athletics in the year 1980, which is the best achievement in a particular sport in a particular year in Summer Olympics history for Russia
- For Russia Athletics, Gymnastics and wrestling are the top three sports
- The USA won 65 medals in Athletics in the year 1904, which is the best achievement in a particular sport in a particular year in Summer Olympics history for the USA
- For the USA Athletics and Swiming are the top two sports
- For India Hockey is the best sports
- For India, Udham Singh Kular and Leslie Walter Claudius are the two best Athelet with 4 medals each in Hockey
- For Russia, Larysa Semenivna Latyninar is the best Athelet with 18 medals in Gymnastics
- For Italy, Edoardo Mangiarotti is the best player with 13 medals in Fencing
- For Japan, Takashi Ono is the best player with 13 medals in Gymnastics
- For Germany, Birgit Fischer-Schmidt is the best player with 12 medals in Canoeing
- For the UK, Henry Taylor with 8 medals in swimming
- For Finland, Paavo Johannes Nurmi with 12 medals in Athletics
- For China, Yang Wei won 7 medals each in Badminton and Gymnastics
- When an Athelets age is 10 to 17 and the athlete is entering a particular Sport, Then his chance to win a Silver Medal is more than that of a Gold Medal
- When an Athelet age is 21 to 25 Then his chance to win a Gold Medal is more than that of a Silver Medal
- When an Athelet age is between 26 to 32 Then his chance to win a Silver Medal is more than that of a Gold Medal
- When an Athelet age is 33 to 36 Then his chance to win a Gold Medal is more than that of a Silver Medal
- When an Athletes age is 22-23 his probability to win a medal is maximum
- For Rhythmic Gymnastics, when an athlete is in age between 15-and 20 the chance to win a gold medal is maximum
- For Beach Volleyball, when an athlete is aged between 25 and 30 the chance to win a gold medal is maximum
- For an Art competition, one athlete can play for up to 59 years
- For Polo, the chance of winning a gold medal is 33 to 36
- For Cycling 20 - 23 is the best years
- One Athelitc can play Shooting up to 60 years
- One Athelitc can play Water Polo for up to 40 years and its peak year is 22-24
- As a comparison between Polo and Water polo, age is a vital factor for water polo
- In Wrestling Males are more than female
- In Weightlifting, females are more than male
- When the Summer Olympics started, Female participation was 0, But as this event has been going on female participation has continuously increased and first female participation was in 1900 and it is 23 number